Dynamic risk prediction of survival in liver cirrhosis: A comparison of landmarking approaches

Electronic health records (EHR) data provides the researcher and physician with the opportunity to improve risk prediction by employing newer, more sophisticated modeling techniques. Rather than treating the impact of predictor variables on health trajectories as static, we explore the use of time-dependent variables in dynamically modeling time-to-event data through the use of landmarking (LM) data sets. We compare several different dynamic models presented in the literature that utilize LM data sets as the basis of their approach. These techniques include using pseudo-means, pseudo-survival probabilities, and the traditional Cox model. The models are primarily compared with their static counterparts using appropriate measures of model discrimination and calibration based on what summary measure is employed for the response variable.


Introduction
In modern medicine, the collection and storage of patient health data no longer restricts physicians or clinicians to using snapshot measurements in predicting health trajectories.Rather, the availability of comprehensive, longitudinal sequences of risk factors and biomarker changes has served as the impetus for innovation in risk prediction methodology.Leveraging this wealth of electronic health records (EHR) data has become a focal point in the medical and bio-informatics literature [1][2][3][4].Simple statistical tools can be applied to these vast EHR data sets for meaningful analysis, but taking full advantage of the longitudinal nature of these data should lead to improved prediction of health outcomes in personalized settings.The practice of incorporating health histories through time-dependent covariates in modeling is known as dynamic risk prediction (DRP).There is substantial variation in the literature on DRP ranging from methods incorporating multi-linear sparse logistic regression [5], deep neural networks [6], landmarking [7], joint modeling [8], and many other statistical techniques.
The landmarking (LM) technique presents itself as desirable for its methodological simplicity and flexibility.The LM method has emerged as a popular option when attempting to dynamically predict outcomes in an EHR setting, updating specified risk prediction models at selected LM times [9,10].Some methods create a variable that measures the change in biomarker measurements from one LM time to the next [10], while others simply incorporate a time variable to interact with selected covariates across the different LM times [11].While each method can be advocated for based on varying statistical or philosophical advantages, the underlying goal is leveraging longitudinal EHR data to improve the predictive accuracy of relevant health events for patients.This paper takes a simplified approach by incorporating an interaction between the selected covariate levels and a time variable.
Naturally, an equally important component of a DRP model along with covariates is the summary measures for the outcome.Several authors have used landmarking to dynamically model survival probability using a Cox Model [9,10], while others have explored the pseudoobservation (PO) approach both through survival probability [12] and restricted mean survival time (RMST) [11].We take a look at both survival probability and RMST and attempt to compare their relative value in an applied setting.With different outcome measures come different measures of prediction accuracy, making any direct comparison potentially misleading.Thus, direct comparison between dynamic methods is avoided, and we focus on comparing the performance of the dynamic models with their static counterparts for each summary measure.
The medical paradigm in which these DRP models will be assessed is the overall survival of patients with liver cirrhosis.Cirrhosis is a leading cause of mortality in the United States, implicated in the death of more than 50,000 patients each year, a number similar to diabetes and pneumonia [13][14][15] and exceeding most cancers [16,17].It remains imperative among hepatologists to identify new ways to maximize the wealth of longitudinal EHR data to improve the prediction of survival outcomes for cirrhotic patients.
The goal of this paper is to compare several different DRP modeling techniques based on the utilization of LM data sets.Specifically, we aim to compare the employment of PO or Cox regression for modeling survival outcomes for subjects diagnosed with liver cirrhosis.These DRP models can all utilize the LM framework but differ in terms of how the survival outcome is summarized (survival probability or RMST) as well as what types of models are employed (Cox model or generalized estimating equations).In Section 2 we will describe the modelbuilding process for each method we intend to compare as well as the data used in training and testing our models.In Section 3 we explore the model performance based on various metrics such as C-statistics and Brier Scores.And, in Section 4 we discuss the various advantages and disadvantages of the modeling procedures.

Methods
The first stage in our modeling process is the construction of LM data sets to be used in model fitting.This process is relatively straightforward and begins with selecting a set of LM times ℓ j .We chose to select a set of 11 landmark times that take place every 0.5 years of follow-up (6 months) for our data analysis.Then, at each LM time ℓ j we construct a data set L j that excludes subjects who experienced their event or were censored prior to ℓ j .
Once the LM data sets are created they can be used in fitting various types of DRP models.Either the LM data sets L j can be combined ("stacked") and used in fitting a "super-model" [11], or a new model can be constructed at each ℓ j [10].Due to the issue of censoring, many of the standard modeling techniques used in dynamic risk prediction (DRP) are not tenable.In this paper, we will compare the use of PO with the standard use of a Cox proportional hazards model.

Pseudo observation approach
The basic concept behind the PO approach is to overcome the issue of censoring by creating a measurement of survival for every subject regardless of event or censoring status.This measurement is created through a transformation of the time-to-event data taking the following form: where Z is the global survival outcome of interest and Z −i is the outcome of interest computed while excluding the i th subject.A PO of this form can be computed for all subjects and can roughly be interpreted as the subject-specific contribution to whatever survival metric is being used.Then, this measurement can be utilized as an outcome variable in a generalized linear model (GLM) [18].
A PO can, in theory, take several different forms, but in application, pseudo-means (PM), and pseudo-probabilities (PP) are intuitive and useful.Due to the challenges of estimating overall means in survival analysis in the presence of censoring, we will be using RMST and computing pseudo-RMST values as outcome variables.The term pseudo-mean will be used to indicate a pseudo-RMST value for the duration of the paper.
RMST, among various other mean-based statistics, has become a regular fixture in survival analysis when considering how to circumvent the issues of modeling and interpreting survival data when hazards are expected to be non-proportional (NPH) [19][20][21][22][23]. RMST can be computed and interpreted as the mean survival time from the beginning of follow-up to some time horizon τ.Thus, it estimates the expected survival duration of a patient until τ.This allows physicians and patients to consider survival outcomes in terms of days, months, and years of life gained or lost based on important covariates.The purpose for using RMST rather than overall mean survival is to account for the limited follow-up time due to censoring.In time-toevent analysis, many subjects won't have their event time recorded if they are either lost to follow-up or the study ends prior to their event occurring.Thus, overall means cannot be directly estimated and may not be identifiable, though various restricted means can.
A restricted mean, mðtÞ ¼ R t 0 SðtÞdt, is graphically represented as the area under the survival curve from randomization (t = 0) to some time horizon (t = τ).In practice, μ(τ) can easily be estimated by computing the area under the Kaplan-Meier (KM) curve between 0 and τ as follows: where t i is the i th event time, t * iþ1 ¼ minðt iþ1 ; tÞ, and ŜðtÞ is the KM estimate of its respective survival function.
The subject-specific PMs to be used as the response variable Y i (τ) in a GLM or GEE, where τ is the restriction time, is computed as: where mðtÞ is the restricted mean computed based on the KM curve with all subjects and mÀ i ðtÞ is the restricted mean computed based on the KM curve excluding subject i as in Eq 2. This particular technique was popularized by PK Andersen [24,25], and has been utilized by several others in DRP modeling in survival analysis [11,26,27].In our case, we compute the PM based on the conditional KM curve, where survival up to time ℓ is the condition.This PM can be interpreted as subject i's individual contribution to the conditional RMST.
These PMs are then used as the outcome variable in a GEE in order to model the covariate effects on survival.Hence, we can model the effects of covariates on the PMs as: where α j is the intercept, β(ℓ j ) = (β 1 (ℓ j ), β 2 (ℓ j ), . .., β p (ℓ j )) are the coefficients of the p predictor variables, X i (ℓ j ) = (X i1 (ℓ j ), X i2 (ℓ j ), . .., X ip (ℓ j )) are the values of the covariates collected until time ℓ j , and g() is a link function.Ultimately, we used the identity link function.
Interactions between covariates and ℓ were included in the models.For instance, the parameter β(ℓ) = (β 1 (ℓ), β 2 (ℓ), . .., β p (ℓ)) could be a vector of functions that describe changes in the covariate effects according to variations of β p (ℓ) = β p0 + β p1 ℓ + β p2 ℓ 2 .If one were to apply some model selection criteria to model coefficients, the linear time interaction, or the squared time interaction might be excluded.While others who have used this approach to dynamically modeling EHR data have performed model selection through some type of step-wise technique, we will use a full model for the sake of comparability between our different modeling approaches.Functions from "geepack" in R were used for fitting the GEE.
When using a PP rather than a PM, one must modify Eq 1 to replace the RMST estimates mðtÞ and mÀ i ðtÞ with estimated survival probabilities pðtÞ and pÀ i ðtÞ at time τ based on the KM curve.The remaining modeling techniques will remain the same, though the interpretation of coefficients must be altered appropriately.

Cox proportional hazards model approach
The LM approach has been applied through the use of the well-known Cox model also.[9,10] An obvious advantage to the Cox model is familiarity with statistical software packages, though its reliance on the proportional hazards assumption remains to be a relevant downside (especially when incorporating many predictor variables in modeling).
The dynamic Cox model can be expressed as: where Λ 0 (τ j ℓ j ) is the baseline cumulative hazard at time τ given survival to ℓ j and α j is a vector of regression parameters to be estimated during model fitting.Estimates of α j can be obtained by maximizing the partial likelihood as would be done in the static case.
Either a new Cox model can be fit at each LM time with its corresponding LM data set, [10], or a "super-model" can be constructed using a stacked LM data set.[9] In this paper, we have chosen to use the method of stacking data sets based on the results of Keogh et al. in which the models fitted from stacked data sets had better performance.

Static models
In order to assess the value of each dynamic model, we also created several static models for comparison.The first, and most basic static model fits a baseline model with covariates available at time t = 0, and uses baseline covariates obtained at time t = 0 for survival prediction at time t = ℓ + τ, conditional on survival up to ℓ.Hence, if we set τ = 3, then at landmarking time ℓ = 3 we would be comparing the dynamic models' 3 year survival predictions (using covariates at t = 3) against this static model's 6-year survival predictions conditional on survival up to year 3 (using covariates at t = 0).This was the static model utilized by Yang et al. (2021) and will be referred to as SM1 [11].
The next static model constructed also fits a single model based on subjects' baseline characteristics at time t = 0 but performs survival prediction based on subject covariates collected at time t = ℓ.Essentially, a baseline model is constructed, but updated covariates are used with this model in performing prediction at t = τ.Though this model uses updated patient information, it is not dynamic since it doesn't take into account any element of patient history, only the patients' current covariate values.This modeling approach will be referred to as SM2.
The final static model used for comparison fits a new prediction model at each landmark time (rather than only one at baseline) and also performs predictions based on the most recently available covariates at landmark time ℓ.This model may seem like it is encroaching on "dynamic" territory, but it should still be considered static as the model only accounts for covariates measured at a specific time point rather than a patient's health history or progression of health indicators.This final modeling approach will be referred to as SM3.

Data
We used the Chicago Area Patient-Centered Outcomes Research Network (CAPriCORN) data which capture the diverse population of the greater Chicago metropolitan area.The dataset encompasses 30 hospitals, and 10 health systems, totaling 12.8 million patients, which allows us to model and assess mortality in patients with liver cirrhosis.The longitudinal nature of this EHR data allows us to compare various dynamic and static prediction models.The CAPriCORN data merges institutional databases while protecting patient health information.The CAPriCORN data was pulled for use by the authors on September 8, 2022, and was fully de-identified prior to any author accessing it.
There were numerous possible predictor variables available in the CAPriCORN data.We chose a subset of possible predictors based on relevant hepatological factors.The time-invariant variables utilized were basic demographic factors such as sex and age.The other important baseline variables are known as "cirrhosis etiology" which is a categorization of the cause of liver cirrhosis (e.g.alcohol use disorder or primary biliary cirrhosis).An individual can have multiple etiologies, so the three etiologies considered in this paper were coded as a 1 if a subject ever was diagnosed with that etiology and a 0 otherwise.The time-dependent covariates associated with liver health were the various relevant lab measurements of liver health (meld-sodium score and albumin level), and certain liver-related health events, known as decompensating events.Also, hepatocellular carcinoma was included as a time-dependent covariate.
The need for written informed consent and ethics approval was waived by the Northwestern University institutional review board due to the retrospective nature of the study.The data were de-identified by the National Patient-Centered Clinical Research Network prior to our use.All methods were carried out in accordance with relevant guidelines and regulations.

Model assessment
In order to assess the accuracy of the prediction models, we evaluate both discrimination and calibration.We measure discrimination in our models by using Harrell's C-index as a form of determining how well each of our dynamic and static models assigns risk based on concordance [28].This measure is computed as the area under the receiver operating characteristic curve and is thus commonly known as AUC.For the general case, AUC can be computed as: where η i indicates the risk score at the prediction time time t = τ that can be substituted for mi in the RMST case or pi in the survival probability case.The indicator functions, such as 1 T j <T i , are 1 when the specified condition is met and 0 otherwise (δ j is an event indicator).The AUC can range from 0 to 1 with scores closer to 1 indicating better prediction accuracy.Technically, scores near 0.5 indicate the worst prediction accuracy, as it is akin to flipping a coin.When measuring calibration (the alignment between observed and predicted survival outcomes), we can take a uniform approach across the different modeling methods, but cannot necessarily make direct comparisons between them.In all cases, we use a Brier score or a slightly modified version of a Brier score for models using PM.Generally, the Brier score is used for measuring the accuracy of probabilistic predictions, such as with Cox models.The standard form for a Brier Score, in the absence of censoring, is as follows: where 1 T i >t is an indicator function that is 1 when the observed time for subject i is greater than the prediction time t.The Brier Score ranges from 0 to 1 with scores closer to 0 indicating better prediction accuracy.
We use a version of a Brier score that incorporates inverse probability censoring weights (IPCW) in order to accommodate for the censoring in our data.Thus, we compute the Brier Score as: where ĜðtÞ is the estimator of the conditional survival function of the censoring times calculated using the KM method, 1 T i �t;d i ¼1 is an indicator function that is 1 when the observed time for subject i is less than or equal to the prediction time t but is an event time, and 1 T i >t is an indicator function that is 1 when the observed time for subject i is greater than the prediction time t.
In the PM case, since survival probabilities are not being estimated, we have the following form for our "Brier Score": This resulting "Brier Score" does not have an upper bound of 1, and is not directly comparable to any Brier Score computed based on survival probability, though lower scores do indicate better performance.
For model assessment, we utilized 75% of the available data set for training our model and the remaining 25% for model assessment.

Results
The size of the entire CAPriCORN data set available for model building was n = 7040.In the overall cohort (n = 7040), there is a median follow-up time of 1.12 years ranging from 0.0027 to 11.01 years to death or loss-to-follow-up.During the follow-up period, 1371 subjects (19.47%) died.The overall 3-year survival rate was 28.13%, and the 8-year survival rate was 3.98%.Dynamic and static models were constructed with a τ = 3 years with landmark times of ℓ = 0, 0.5, . .., 4.5, 5 years.Table 1 displays the coefficients with their standard errors and pvalues included in the dynamic models, and Table 2 displays the results from the SM2 static model fit using baseline covariates for the sake of comparison.
From Table 1 it is clear that the time-varying measures of MELD-Sodium score, Albumin level, and the number of decompensating events are highly statistically significant in the dynamic model, while the baseline covariates of age and etiology of alcoholic cirrhosis (ETOH) are also highly significant.From Table 2 we see that the same predictor variables are significant.The time function used in this modeling procedure fits a linear and a quadratic interaction effect between time and the eight covariates of interest.It adds flexibility for modeling the effects of the covariates.A significant p-value with any of the interaction effects indicates a significant relationship between the specific covariate and time (either linear, quadratic, or both).More complicated interactions with time are possible (e.g., splines), though they extend outside the scope of this paper.
The dynamic coefficient of any covariate used in this modeling procedure can be calculated by the following formula: For example, in the PM model at LM time ℓ = 5, the covariate effect of having a decompensating event is −0.151 + 0.64 * 1 − 0.671 * 1 = −0.182.So, holding all other variables constant, patients who experience decompensated at the LM time of 5 years are expected to lose 0.182 mean survival years compared to patients who don't experience a decompensating event.
In practice, model selection criteria should be employed to potentially eliminate certain variables, including certain interaction effects.For example, it may make sense to exclude the linear and quadratic interaction effect between time and HCC status in the PM model as the pvalues for those coefficients are high.For our purposes, we wanted each of the dynamic models to be as directly comparable as possible.Hence, we included all covariates and interactions in the final models.
Fig 1 can be used to visualize the difference in 3-year conditional RMST dynamic coefficients over the 5-year LM period with 95% confidence intervals, while the same figures for the PP and Cox models can be found in the S1 Appendix.Essentially, at any specific LM (represented on the x-axis) one can see the covariate-specific effect (on the y-axis) of 3-year RMST of a one-unit increase in the variable of interest (while holding all other variables constant).Due to the inclusion of a quadratic time interaction in our model, the resulting 3-year conditional RMST dynamic coefficients will have a parabolic structure, so long as the quadratic term is not equal to 0. If we had only included a linear time interaction, we would instead see a straight line with some non-zero slope (as long as the interaction was non-zero), and had we included no time interaction they would be horizontal lines.The clinicians and physicians using the dynamic model with time interactions could then interpret the relationship between the landmarking times and the dynamic coefficients and better understand the relationship between time and the covariate-specific effect on survival.

Individual dynamic prediction
We can also utilize the dynamic model to provide individual 3-year survival predictions for patients with liver cirrhosis.Fig 2 shows the individual predictions with the dynamic and static models for two different patients based on each model type.The solid line represents the dynamic conditional 3-year survival prediction, and the dashed line represents the 3-year survival prediction based on the static model fit with baseline covariates at the beginning of the study (SM2).In a clinical setting, it can be valuable to distinguish the risk of death between patients when evaluating candidates for certain treatment options (such as transplants).In Fig 2 , patient A had a relatively high meld score through the duration of follow-up, experienced multiple decompensating events, and was diagnosed with hepatocellular carcinoma (HCC) at year 1, whereas patient B had a relatively low meld score until roughly year 4, experienced no decompensating events, and was never diagnosed with HCC.Notably, patient A experiences a large decrease in their predicted 3-year survival from ℓ = 0.5 to ℓ = 1 (dropping from 2.58 years to 2.27 years and from 2.46 years to 1.99 years in the dynamic and static RMST models respectively) when they were diagnosed with HCC.We also see that the dynamic models are not as strongly influenced by risk factors that take place at a specific time point.For example, when patient B experiences an increase in meld score around year 4, the static model predicts a sharper decrease in survival (2.21 years in the RMST model, 64.80% in the PP model, and 60.95% in the HR model) than does the dynamic model (2.64 years in the RMST model, 77.55% in the PP model, and 97.99% in the HR model) which incorporates the entire history of meld scores.In the event a clinician wishes to evaluate different candidates for transplant at different time points, comparing risk prediction could serve as a useful tool.

Model performance
As described in Section 2.5, the quality of each dynamic model's performance was measured in terms of discrimination (AUC) and calibration (prediction error or Brier score) and compared  When considering the calibration of these models, no static model appears to perform better across all 3 modeling procedures than their dynamic counterpart.In the PM model, the crude static model has a better prediction error than either of the other static models but has the worst Brier score in the PP model.Likewise, SM2 has the worst prediction error in the PM model, but the best Brier score in the Cox model.Essentially, even though the dynamic model is not uniformly better than its static counterparts in any given model, it has a very similar AUC and prediction error/Brier score to the best static models.The strong performance of SM2 and SM3 might be due to their use of the most current covariate information when predicting outcomes.In many health settings, it may be the case that the most current health information does most of the leg work in terms of predicting future outcomes, and that any modeling procedure that incorporates this information will perform relatively well.
The dynamic model seems to be most advantageous when using the PM approach, and least advantageous when applying a Cox modeling procedure.Thus, determining what type of summary measure is desired when modeling should influence a researcher's choice of modeling technique (dynamic or static) in a setting similar to ours.If a summary measure is desired in terms of survival duration, then a dynamic PM model might be best.Whereas, if the prediction is desired in terms of survival probability, and a standard Cox modeling procedure is desired, then a static model fit at baseline (such as SM2) might be a reasonable choice.
From this figure, we can see that SM2 and SM3 perform well in relation to the dynamic models.This may indicate that the most current health information may be of the greatest value when performing prediction.

Discussion
Dynamic risk prediction modeling is a vast research-scape and offers medical researchers a new paradigm for dealing with EHR data.In this paper, we have specifically touched on three different modeling techniques that apply the LM approach, but this accounts for a mere fraction of the various complex methods presented in recent literature.
Overall, the LM method proves to be intuitive, straightforward, and useful.Even though the dynamic models built from LM data sets were not an overwhelming favorite in terms of model performance, they were always at least as good as their static counterparts, and better in several specific cases.
One notable downside to the LM approach is the computational time involved in constructing the LM data set.The vast majority of computing time was devoted to this facet of modeling where the cutLM() function from the "dynpred" package in R was used.For smaller data sets, this step would not be too intensive, but with EHR data sets containing numerous subjects, this could become a process that requires extremely powerful computing equipment.On the other hand, simple static models could be rapidly built with little resources required.
Alternatively, a modeling procedure based on LM data sets could be employed in a similar fashion to Parast et al.where differences in the time-varying covariates are measured across the LM times and used as a predictor variable in model fitting.This is intuitive, however, Parast et al. found only marginal improvements in model calibration and no improvement in discrimination (as compared to a static model) while applying this methodology to Diabetes Prevention Program (DPP) data [10].
Since the CAPriCORN data is not available to the public, readers could perform our analysis using the "pbc2" data in the "dynpred" package in R.This data similarly measures survival in cirrhotic subjects and has been used to test the quality of DRP methods such as those presented by Yang et al. [11].

Conclusion
The ultimate goal of dynamic risk prediction is to improve the accuracy of personalized prediction rules for individual patients.As data collection becomes easier, faster, more expansive, and more accurate, this goal can become more attainable.While the quality of the predictions from the DRP models discussed in this paper may not be overwhelming, their use offers similar performance to static models and encourages the continued advancement in statistical methodology.Given the prevalent use of risk prediction modeling in medical settings, even simple advances in prediction quality can benefit many patients in the future.

Fig 1 .
Fig 1. Difference in 3-year conditional RMST in the dynamic RMST model over the 5-year LM period.The solid line represents the dynamic coefficients β p (ℓ), or the difference in conditional 3-year RMST resulting from a single unit increase in the p th covariate at ℓ.The black dashed line represents the 95% CI. https://doi.org/10.1371/journal.pone.0306328.g001

Fig 2 .
Fig 2. Individual predictions with the dynamic and static models for patients A and B with all-cause mortality as the event of interest.The solid line represents the dynamic 3-year survival prediction, and the dashed line represents the 3-year prediction based on the static model fit with baseline covariates at t = 0 (SM2).https://doi.org/10.1371/journal.pone.0306328.g002

Fig 3 .
Fig 3. Assessment of model performance in terms of AUC and Brier score.The solid line represents the dynamic model and the dotted/dashed lines represent the 3 different static models.https://doi.org/10.1371/journal.pone.0306328.g003

Table 2 . The results of the static models fit at baseline with τ = 3 and mortality as the event of interest.
https://doi.org/10.1371/journal.pone.0306328.t002